Population genomics of the neotropical palm Copernicia prunifera (Miller) H. E. Moore: Implications for conservation

Copernicia prunifera (Miller) H. E. Moore is a palm tree native to Brazil. The products obtained from its leaf extracts are a source of income for local families and the agroindustry. Owing to the reduction of natural habitats and the absence of a sustainable management plan, the maintenance of the natural populations of this palm tree has been compromised. Therefore, this study aimed to evaluate the diversity and genetic structure of 14 C. prunifera populations using single nucleotide polymorphisms (SNPs) identified through genotyping-by-sequencing (GBS) to provide information that contributes to the conservation of this species. A total of 1,013 SNP markers were identified, of which 84 loci showed outlier behavior and may reflect responses to natural selection. Overall, the level of genomic diversity was compatible with the biological aspects of this species. The inbreeding coefficient (f) was negative for all populations, indicating excess heterozygotes. Most genetic variations occurred within populations (77.26%), and a positive correlation existed between genetic and geographic distances. The population structure evaluated through discriminant analysis of principal components (DAPC) revealed low genetic differentiation between populations. The results highlight the need for efforts to conserve C. prunifera as well as its distribution range to preserve its global genetic diversity and evolutionary potential.


Introduction
Habitat reduction and deforestation resulting from human activities have had adverse effects on forest populations, contributing to high rates of species extinction, particularly in the Neotropical region [1]. With the exception of some natural areas, most tropical species occur in anthropogenic landscapes, where the previously continuous forest has now been reduced to a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 opportunity to evaluate local adaptation patterns; neutral loci are similarly affected by the demographic and evolutionary history of populations [27].
Genetic diversity studies based on molecular markers of natural populations of C. prunifera in tropical areas such as Caatinga and Restinga are still scarce [17,[28][29][30]. In addition, no studies on C. prunifera have applied next-generation sequencing technology for data acquisition in population genomics. Due to the importance of this neotropical palm tree for local communities and considering the rapid and recent increases in the exploitation of its populations, the present study employed next-generation sequencing to evaluate the genetic diversity and structure of 14 natural populations of C. prunifera in two environments (Caatinga and Restinga) in Brazil using SNP markers to provide information that can help in the design of efficient strategies for the conservation and sustainable use of this species.

Plant material and DNA extraction
In the present study, 160 individual plants from 14 populations of C. prunifera were evaluated. Out of the samplings collected, 10 populations came from the Caatinga (RUS, LGP, SER,  S1  Table). The distance between the plants evaluated within the 14 populations was 15-20 m, with a minimum height of 6-10 m; regenerating and young plants were not collected. The IPG population is composed of a different type of carnaúba, known as "white carnaúba," which is phenotypically distinct from the "common carnaúba" due to the presence of a light stipe, smaller fruits, and the absence of thorns in the petiole in addition to limited occurrence in the region [14].
Small pieces of leaves were cut using a tree trimmer, placed in plastic tubes containing 2 mL of hexadecyltrimethylammonium bromide (CTAB 2X), labeled, and stored in a freezer at -20˚C until DNA extraction. This study was conducted according to the recommendations of the Brazilian Ministry of the Environment and registered in the National System of Management of Genetic Heritage and Associated Traditional Knowledge (SISGEN; Sistema Nacional de Gestão do Patrimônio Genético e do Conhecimento Tradicional Associado) with the number A411583.
Genomic DNA was extracted from the processed leaves according to the protocol described by Doyle and Doyle [31]. DNA quality was evaluated in a 1% agarose gel and stained with SYBR Safe™ (Life Technologies Corporation) for visualization under ultraviolet light, using lambda phage DNA of known concentrations as a reference. Quantification of the samples was performed using a Qubit 3.0 fluorometer with the dsDNA BRKitt (Life Technologies), and the DNA was standardized to a concentration of 30 ng.μl -1

GBS library preparation and high-performance sequencing
To obtain SNPs, genomic libraries were developed using the genotyping-by-sequencing technique (GBS) with two restriction enzymes, according to the protocol described by Poland et al. [32] with modifications. First, 7 μl of genomic DNA from each sample was digested at 37˚C for 12 h using the restriction enzymes NsiI and MspI. Subsequently, 0.02 μM barcode-specific adapters for Illumina technology were ligated to the ends of the digested fragments. Binding reaction was performed at 22˚C for 2 h, 65˚C for 20 min, 10˚C indefinitely. After the adapters were ligated, the samples were purified using a QIAquick PCR Kit (Qiagen). The library was enriched by PCR (Polymerase Chain Reaction) using the following amplification program: 95˚C for 30 s, followed by 16 cycles of 95˚C for 10 s, 62˚C for 20 s, and 72˚C for 30 s, and ending at 72˚C for 5 min. Finally, the library was purified using a QIAgen 1 QIAquick PCR Purification Kit. The Agilent DNA 12000 kit and Agilent 1 2100 Bioanalyzer System were used to verify the average size of the DNA fragments. Sequencing was performed using the Illumina 1 HiSeq 2500 Mid Output Kit v4 (50 cycles) (Illumina Inc., San Diego, CA, USA) in a single-end configuration.

Identification of SNPs
The identification of SNPs was performed using Stacks software v.1.42 [33,34]. The first step comprised filtering and demultiplexing with the process_ radtag module. In the absence of a reference genome for C. prunifera, the DeNovo Stacks pipeline was used, starting with the ustacks module to identify putatively homologous read stacks (putative loci). This step was performed for each sample separately using the following parameters: minimum stack depth (-m = 3) and maximum distance between stacks (-M = 2). The loci of each sample were grouped into a catalog using the cstacks module, allowing a maximum distance of two nucleotides (-n 2) between the loci of each sample. Loci with lower probability values (-lnl_lim -10) were eliminated using the rxstacks correction module. Finally, the population module was used to filter the SNP markers using the following parameters: only one marker per sequenced tag, frequency of least frequent allele (MAF � 0.01), minimum stack depth 3X, and minimum occurrence in 75% of saplings in each population.

Loci determination under selection
Two complementary tests were performed, pcadapt and fsthet, were performed to detect outlier loci (hypothetically under selection). The pcadapt method [35] was used to identify loci associated with the genetic structure revealed by a principal component analysis (PCA), that is, without any underlying genetic model. The analysis was performed using the pcadapt package [35] on the R platform [36] by retaining the first eight principal components of the PCA and considering the loci with q-values � 0.1 as outlier SNPs. The fsthet method [37] was used to identify loci with F ST values that were excessively high or low compared with what was expected under neutrality. The analysis was performed using the fsthet package [37] on the R platform [36] by considering the loci below or above the 95% confidence intervals constructed with 1000 bootstraps for the expected relationship between H E and F ST as outlier SNPs. This test was performed by considering the estimates of F ST in two different scenarios: i) comparing the Restinga and Caatinga populations and ii) comparing the samples of the white morphotype with those of conventional morphotype. The final set of SNP markers hypothetically under selection consisted of loci identified as outliers in at least two of the three tests performed. Thus, the outlier SNP loci may reflect the action of selection on different types of vegetation.
Sequences containing outlier SNPs were searched with the BLASTX tool against the genomic data set of the National Center for Biotechnology Information (NCBI) using blast2go [38]. This analysis was performed to identify the similarities between the protein-coding data deposited in the NCBI database and the loci with outlier SNPs identified. For sequences with significant BLASTX hits, the functional annotation associated with characterized and/or described coding sequences was performed using the gene ontology system (GO terms). GO terms summarize information on cellular components, molecular functions, and biological processes in which the gene products are involved.

Population genomic analyses
Genetic diversity was estimated based on the number of alleles (A), number of private alleles (Ap), observed heterozygosity (H O ), and expected heterozygosity (H E ). Inbreeding coefficients (f) were also estimated, and their confidence intervals were obtained using 1000 bootstraps. Estimates of diversity and inbreeding were obtained using the diveRsity [39] and PoPPr [40] packages of the R software [36]. The distribution of genetic variation within and between populations of C. prunifera was evaluated using analysis of molecular variance (AMOVA), and its significance was tested with 10,000 permutations using the PoPPr program [40].
Genetic differentiation was estimated using pairwise F ST values with confidence intervals of 1000 bootstraps, using the diveRsity package [39] of R software [36]. The population structure was evaluated using discriminant analysis of principal components (DAPC) with Adegenet [41,42] for R software [36]. This analysis was performed for neutral loci, and priori groups were defined from the 14 sampling sites. DAPC does not presuppose the underlying population genetic processes (e.g., binding equilibrium and Hardy-Weinberg equilibrium) common to other methods used to detect population structure, and as it is based on principal component analysis, this method can analyze genomic datasets relatively efficiently [43].
Genetic relationships and divergence between individuals were investigated by constructing a dendrogram generated based on the distance of Nei using the neighbor-joining method [44]. The final dendrograms were formatted using MEGA version 7 [45].

Identification of SNPs and determination of loci under selection
Sequencing of the genomic libraries resulted in 566,922,165 reads, and after quality control, the total number of reads retained was 397,047,980. In total, 1,013 SNPs (average depth of 21X) were identified. A total of 391 outlier SNPs were identified using the pcadapt method, 70 using the fsthet method to compare morphotypes, and 47 using the fsthet method to compare vegetation types (Fig 3). Of these, 84 SNP markers were identified in at least two of the three tests and were considered hypothetically under selection, whereas the other 929 markers were considered as neutral loci. Among the outlier loci, 55 were putatively under positive selection and 29 were putatively under balancing selection. Only six outlier loci were found in the sequences similar to the annotated proteins (S2 Table). Considering the results of the GO terms, the most frequent annotations for these proteins were the molecular functions of "binding" and "catalytic activity," and the biological process of cell metabolism (S1 Fig).

Population genomic analyses
The genomic diversity estimates were based on 929 neutral SNP markers. The number of alleles (A) ranged from 1,059 to 1,497. The IPG population (white carnaúba) had the lowest number of alleles, probably because of the small sample size of the population. Expected heterozygosity (H E ) ranged from 0.201 to 0.265 (Table 1). The APD population had higher genetic diversity (H E = 0.265) and the highest number of private alleles (Ap = 40) compared with that of the other populations. The inbreeding coefficients (f) were similar and negative for all populations, indicating an excess of heterozygotes.
F ST values 0-0.05 and 0.05-0.15 indicate low and moderate genetic differentiation, respectively, whereas values > 0.15 indicate high differentiation (Hartl, Clark 1997). In the present study, F ST estimates suggested low to high genetic differentiation between populations of C. prunifera (Table 2). In general, there was a greater differentiation between the population from MACE and those from the other sites (F ST ranged from 0.118 to 0.20). In addition, the SMG and IPG populations showed moderate levels of differentiation. A low genetic structure was observed for the populations from LGP and SER (0.18) and AR1 and AR2 (0.19), suggesting a genetic flow between these localities.
The low genetic divergence suggested by the pairwise F ST was also observed in DAPC, which retained 28.7% of the total variation in the first two principal components (Fig 4). This analysis also showed greater genetic differentiation of the population from MACE in comparison with that of others in addition to pointing out an overlap between individuals from almost all populations, especially AR1, AR2, and RUS.  Analysis of molecular variance (AMOVA) indicated that most of the variation was found within populations (77.26%), and the genetic differentiation between populations was high and significant (φ = 0.227) ( Table 3). The Mantel test revealed a positive and significant correlation between the geographical and genetic distances based on the F ST values (r = 0.0612; p = 0.002).
According to the dendrogram (Fig 5), the MACE population was the most genetically distant, corroborating the results observed in the DAPC population. Individual saplings from the LGP and SER populations exhibited similar levels of genetic similarity. In addition, there was a    Table 4).
The populations from the Caatinga had the largest number of private alleles (Ap = 253) compared to the Restinga populations and this result is probably associated with the sample size. The inbreeding coefficients (f) are both similar and negative. In addition, the F ST estimates suggested low genetic differentiation between the Caatinga and Restinga populations (F ST = 0.008). When considering only two morphotypes of C. prunifera (white carnaúba and common carnaúba), similarities were observed in the estimates of diversity in addition to low genetic differentiation (F ST = 0.008) ( Table 4). Analysis of molecular variance (AMOVA) among the vegetation types (Caatinga and Restinga) produced small genetic differentiation (φ = 0.024). It revealed 97.522% of the genetic variation within the vegetation types whereas, 2.478% of the total genetic variation was observed between types of vegetation (Table 5).

Loci putatively under selection
The large number of SNP markers obtained in this study allowed for the identification of loci with deviations from the expected neutral behavior, which are putatively under selection  (outlier loci). The identification of outlier loci is an important step in understanding local adaptation and evaluating the evolutionary potential of a species [46]. The palm tree C. prunifera has no annotated reference genome, and probably for this reason, most sequences with outlier loci are similar to uncharacterized proteins. Regarding the results obtained from the annotation, most loci are associated with genes involved in metabolic processes, which have been regularly found under selection in a variety of organisms because the gene functionality correlates with environmental stressors [47]. Interestingly, some annotated loci were associated with genes of transposable elements (S2 Table). According to Gogvadze and Buzdin [48], transposable elements promote changes in the genome, which is an important evolutionary mechanism for the adaptation of organisms to changes in environmental conditions. This is expected in C. prunifera because the palm trees grow in different environments such as seasonally flooded areas in the semi-arid region [49]. In addition, outlier loci may be associated with environmental differences in the collection sites, especially as sampling areas are scattered over the Restinga and Caatinga.
It is important to highlight that the analyses performed in this study are unable to indicate associations between genomic and functional variation; therefore, it is not possible to associate generic molecular functions or biological processes with any adaptive traits involved in the diversification of the evaluated populations. Therefore, studies with larger sample sizes with better representation of the different geographical habitats are needed to generate information on the evolution and diversification of C. prunifera. Small sample sizes belonging to populations with relatively small geographical distances, which enable gene flow to quickly spread new adaptations to surrounding areas, reduce the capacity to detect recent evolutionary changes [50]. However, the identified outlier loci can be used as candidates in association mapping studies. Thus, integrative approaches of association genetics, genome-wide scans, and measurements of phenotype selection are necessary to understand the adaptive nature of a given allele [51].

Genetic diversity, inbreeding, and structure
Genetic diversity is one of the three classes of biodiversity recognized as a global conservation priority and plays a decisive role in conservation efforts. Genetic diversity has a substantial effect on both individual fitness and the adaptive capacity of the population, playing a vital role in maintaining the capacity of species to withstand various biotic and abiotic stressors and evolve under altered environmental conditions [52]. The present study provides the first estimates based on SNPs for genetic diversity in C. prunifera. The GBS approach used in this study produced a large number of SNP loci for the genomic evaluation of this palm tree without the need for a reference genome. This has resulted in robust estimates of diversity and patterns of genetic structure.
The results of genetic diversity and population structure were similar based on the results of the analysis according to population (among the 14 localities), type of vegetation (Caatinga and Restinga), and morphotype (common carnaúba and white carnaúba). In all situations, the populations showed a negative f value, suggesting limited inbreeding with reduced self- pollination capacity under environmental conditions. Therefore, individual plants are less related than expected under conditions of random mating. Genetic diversity and population structure is influenced by biological characteristics of the species, including the mating system [53]. Therefore, the reproductive biology of this species may explain the observed patterns of genetic variation. The mating system of C. prunifera is mixed and preferably allogamous [5], which favors the crossing between unrelated individuals. Thus, inbreeding coefficients are reduced, and the maintenance of genetic diversity within populations is ensured. Although the evaluated populations were susceptible to anthropogenic threats, it is possible that they had high genetic diversity (H E ). High levels of genetic diversity led to an increase in long-term survival of a species; therefore, a strong positive correlation exists between heterozygosity and population fitness, which is important for populations to adapt to new environmental conditions [54]. This high level of diversity is expected in forest species that are largely not domesticated as a result of local adaptation and neutral evolutionary processes in heterogeneous environments [22].
The identification of private alleles is useful for genetic conservation [55]. In the present study, the populations from APD (Ap = 40), AR1 (Ap = 35), and MOS (Ap = 28) had the highest number of private alleles and diversity was not found in the other localities; therefore, these populations deserve special management because the levels of private alleles are indicative of individual fitness and explain the evolutionary potential of populations and their ability to adapt to the adverse environmental conditions [21]. Therefore, this information can be used to increase the genetic representation in germplasm banks. and to convey the need to explore seed collection in situ to ensure future replacement.
Genetic variation in plant species is strongly affected by several historical and demographic factors, including geographic distribution, life form, and population size [56]. The results of AMOVA showed that most of the genetic diversity was found within C. prunifera populations (Table 3). Similarly, Santos et al. [17] analyzed the genetic differentiation of this palm tree in the northeast region of Brazil and found that 62.86% of molecular variance was accounted for by differences within populations. These results agree with those of different studies conducted on forest species that reproduce by allogamy, seeing as these species have maintained most of their genetic variability within populations [57].
Genetic structure analyses indicated that the 14 collection sites did not belong to a single homogeneous population, and the geographically closest populations showed low values of pairwise F ST and overlap in the DAPC. Greater genetic similarity was found between the populations from LGP and SER and between AR1 and AR2. In addition, low genetic differentiation was observed when the populations were evaluated according to vegetation type and morphotype.
The low level of global genetic differentiation found between the populations studied here (supported by F ST , cluster analysis, and DAPC) and the higher proportion of genetic diversity within populations with only fewer partitions between them could result from the combined effect of different factors, such as cross rate, reproductive system, and high genetic flow rate in this species. The Mantel test corroborates this result. Since geographically close populations tend to be genetically similar, this indicates a pattern of isolation by distance. However, the MACE population had the lowest level of diversity (H E = 0.201) and the highest degree of structuring, being the most genetically divergent population compared with the others. This differentiation was supported by the F ST value, which is an indirect estimator of the population connectivity between subpopulations ( Table 2). The population from MACE corresponds to a small population with the lowest number of plants in an area of approximately 0.9 hectares, and spatially isolated from other populations. Furthermore, anthropogenic factors are observed in higher intensity in this population, this fact is probably related to the intense exploitation of carnauba wax in this area [14]. These factors can lead to a reduction in genetic diversity within population, likely as a result of genetic drift [58]. Santos et al. [17], using ISSR markers, also indicated that MACE population has a high differentiation, and genetic discontinuities were observed between this population and the others, with indications of a recent genetic bottleneck. Therefore, the impact of human activities may have contributed to the levels of genetic differentiation observed in the MACE population.

Implications for conservation
Conservation genomics is an extension of conservation genetics that seeks to apply genomic techniques to the practical management of natural populations [59]. In this context, evaluations of genetic variations in the entire genome are powerful approaches to gain an understanding of the processes that lead to molecular diversification and inform effective management and conservation strategies [60]. However, application in real-time has been slow and a persistent gap exists between theory and practice.
In Brazil, the legislation that guides forest management does not clearly describe the importance of genetic evaluation within natural populations; therefore, information that seeks to associate genetic data with the formulation of sustainable management plans is unfortunately not mentioned [61]. Although C. prunifera is not listed as an endangered species, the expansion of agricultural activities over time has contributed to a reduction in its natural population [17]. Therefore, conservation measures are necessary to minimize the additional loss of alleles and to ensure the maintenance of genetic resources.
Conserving genetic diversity within a population should be the cornerstone of any conservation strategy aimed at ensuring the long-term persistence of species and habitats [62]. In situ and ex situ conservation strategies are considered promising alternatives for the conservation of forest genetic resources (FGR) and aim to maintain the genetic diversity of species over time, preserving the evolutionary processes and adaptive potential of populations [63]. Although ex situ approaches have the potential to conserve much of the biological diversity, they do have a limitation of being more suited and efficient for conservation in plants that have orthodox seeds. Therefore, in situ conservation is recommended for C. prunifera because this species contains recalcitrant seeds [64]. However, active management, including the establishment of in vivo seed banks and the promotion of natural regeneration, can prevent the decrease of population size, loss of genetic variability, and ensure long-term conservation [17].
The high genetic diversity observed in the evaluated populations of C. prunifera indicates the need for large areas of land dedicated to in situ conservation for capturing the existing genetic diversity of these populations. F ST values estimated in the present study could help in recommending the optimal number of populations for sampling, including populations that had the highest estimates of diversity and the largest number of private alleles.
C. prunifera exploitation is an important source of employment and income for local communities in the semi-arid region of Brazil. In this context, the rational management of palm tree products should be a principal strategy in the efforts to conserve the natural habitats of the species. Another strategy aimed at conservation and sustainable use would be the development of a community and family forest management (CFFM) plan [65], which consists of the planning and management of actions and appropriate techniques for the sustainable use of forest resources aimed at traditional communities and family farmers [66].
In addition, practical measures aimed at successful plant regeneration, such as the pause of extractive activity during reproductive periods and the introduction of rotation cycles for leaf harvesting in the explored areas, need to be implemented for the sustainable management of carnaúba. However, the current social and economic conditions of workers employed in the activity of extraction and production of carnaúba wax must be considered. Workers in poorer areas need to be provided additional support, including investments, to maintain the balance between socioeconomic demand and conservation, which would pave the way to a more sustainable supply of resources while reducing the pressure of uncontrolled harvesting.
A third approach would be to preserve populations and divergent genetic groups identified in this study throughout their geographic distribution range through effective long-term genetic and ecological monitoring, stimulating the development of ecological corridors between fragments and natural forests, and avoiding the reduction of genetic variability. In addition, interdisciplinary programs that study different aspects of C. prunifera populations (e.g., habitat quality, impact of extractive activity on individuals, and genetic diversity) throughout their distribution would be fundamental for the successful implementation of species conservation management.